logo

1 Sample metrics

1.1 Alignment summary

The provided genome measures 32803248nt, of which 13608910nt were predicted as simple or transposable repetitive elements (41.49% of the genome).
The genome sequence file with the repetitive positions lower-cased is available from here.
The genomic coordinates of the repetitive elements are available from here.
Sequencing reads were aligned to the provided genome using BWA[1].
Table 1 reports the sequencing read alignment statistics computed with Picard[2]
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#AlignmentSummaryMetrics for metrics definition

Table 1: Alignment statistics
CATEGORY TOTAL_READS PF_READS_ALIGNED PCT_PF_READS_ALIGNED PF_ALIGNED_BASES PF_MISMATCH_RATE PF_INDEL_RATE MEAN_READ_LENGTH READS_ALIGNED_IN_PAIRS PCT_READS_ALIGNED_IN_PAIRS PF_READS_IMPROPER_PAIRS PCT_PF_READS_IMPROPER_PAIRS STRAND_BALANCE PCT_CHIMERAS PCT_ADAPTER
FIRST_OF_PAIR 14641170 14347825 0.979964 1543165315 0.001961 0.000300 108 14309341 0.997318 101001 0.007039 0.500309 0.001614 0.002372
SECOND_OF_PAIR 14663651 14314351 0.976179 1532962669 0.003626 0.000294 108 14309341 0.999650 67527 0.004717 0.499734 0.001531 0.002351
PAIR 29304821 28662176 0.978070 3076127984 0.002791 0.000297 108 28618682 0.998483 168528 0.005880 0.500021 0.001573 0.002361

1.2 Read insert size

Figure 1 demonstrates the insert size distributions for read pairs with Forward-Reverse (FR), Reverse-Forward (RF) and TANDEM orientation.
The statistics relative to the pair-end insert size were computed with Picard and reported in Table 2
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#InsertSizeMetrics for metrics definition.

Figure 1: Insert size distribution
Table 2: Insert size statistics
MEDIAN_INSERT_SIZE MODE_INSERT_SIZE MEDIAN_ABSOLUTE_DEVIATION MIN_INSERT_SIZE MAX_INSERT_SIZE MEAN_INSERT_SIZE STANDARD_DEVIATION READ_PAIRS WIDTH_OF_10_PERCENT WIDTH_OF_20_PERCENT WIDTH_OF_30_PERCENT WIDTH_OF_40_PERCENT WIDTH_OF_50_PERCENT WIDTH_OF_60_PERCENT WIDTH_OF_70_PERCENT WIDTH_OF_80_PERCENT WIDTH_OF_90_PERCENT WIDTH_OF_95_PERCENT WIDTH_OF_99_PERCENT
FR 509 473 81 2 1819156 524.1213 133.8351 14259259 31 61 93 127 163 203 253 319 433 553 833
RF 3954 3828 3493 1 2504221 8245.2586 9374.6391 9439 351 667 1457 4721 6987 11423 21795 45555 96961 148349 490505
TANDEM 74940 38 66820 1 2245262 136546.5588 174137.2072 2301 14741 41645 79177 107561 133641 148691 430591 479553 1032863 1984679 2905381

1.3 Read duplicates

Duplicated reads (i.e. originating from a single fragment of DNA) were detected with Picard.
Table 3 reports read duplicates statistics.
See https://broadinstitute.github.io/picard/picard-metric-definitions.html#DuplicationMetrics for metrics definition

Table 3: Read duplicates statistics
UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED SECONDARY_OR_SUPPLEMENTARY_RDS UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
77885 14513542 33143 642645 34391 204201 0 0.015214 510924154

The saturation curve in Figure 2 shows the return of interest (ROI) if additional sequencing data was produced from the same library.
For sequencing at higher depth the estimated gains in library complexity (i.e. the number of unique molecules in the sequencing library) does not grow linearly since more and more of the reads are duplicates.
The left most dot indicates the current ROI.
See https://broadinstitute.github.io/picard/faq.html for more info regarding the histogram produced by MarkDuplicates.

Figure 2: Return of investment for sequencing the library at higher coverage

1.4 Genome coverage

1.4.1 Sequencing depth

Table 4 reports the median sequencing coverage (MEDIANCOV) for individual chromosomes.
MEDIANCOVminus2MAD and MEDIANCOVplus2MAD reflect the sequencing coverage variability, indicating the median sequencing coverage -/+ 2 mean absolute deviations respectively.

Table 4: Chromosome coverage
CHR MEDIANCOV MEDIANCOVminus2MAD MEDIANCOVplus2MAD
LinJ01 83 65 101
LinJ02 92 66 118
LinJ03 78 62 94
LinJ04 88 70 106
LinJ05 130 106 154
LinJ06 113 93 133
LinJ07 89 71 107
LinJ08 88 66 110
LinJ09 116 94 138
LinJ10 87 67 107
LinJ11 88 68 108
LinJ12 81 61 101
LinJ13 80 62 98
LinJ14 76 58 94
LinJ15 92 72 112
LinJ16 94 76 112
LinJ17 79 61 97
LinJ18 80 62 98
LinJ19 77 59 95
LinJ20 123 101 145
LinJ21 79 63 95
LinJ22 77 55 99
LinJ23 95 75 115
LinJ24 79 63 95
LinJ25 80 64 96
LinJ26 97 79 115
LinJ27 77 57 97
LinJ28 79 63 95
LinJ29 79 61 97
LinJ30 81 65 97
LinJ31 177 143 211
LinJ32 81 65 97
LinJ33 80 62 98
LinJ34 77 59 95
LinJ35 90 72 108
LinJ36 79 63 95

1.4.2 Binning

The genome was divided in contiguous bins spanning 300 nucleotides. For each bin GIP:

  • Measured the sequencing coverage of each nucleotide
  • Computed the mean bin coverage
  • Normalized by the median chromosome coverage
  • Computed the mean MAPQ score of the reads aligning to the bin

The median of the genomic bin intervals is 84.9967.

1.4.3 Somy score normalization

To calculate the somy score, the mean sequencing coverage of each genomic bin with mean MAPQ score > 50 was normalized by the median of all bins and multiplied by two.
Figure 3 and Figure 4 are two alternative somy score distribution visualizations.

Figure 3: Boxplot somy score distribution
Figure 4: Density plot, somy score distributio

The genome sequencing coverage density is available in bigWig format from here.
BigWig file format is compatible with genome browsers such as IGV[3].
A description of the bigWig format is available from https://genome.ucsc.edu/goldenPath/help/bigWig.html

2 Genomic bins coverage

Genomic bins with mean MAPQ score > 50 were evaluated to identify significantly amplified or depleted regions.
Considering an adjusted p-value threshold of 0.001 and considering possible user-defined additional coverage cut-offs, a total of 351 genomic bins were deemed to be either significantly enriched or depleted.

Figure 5: Genomic bin coverage overview
Figure 6: Genomic bin coverage in separate chromosomes

Significant bins can be found here
Collapsing significant adjacent bins into larger regions (segments) and averaging the bin coverage resulted in 42 amplified regions and 3 depleted regions available from here

Table 5: Example of significantly amplified segments
chr start end score segmentID
LinJ27 1173601 1175100 12.06 segment_30
LinJ31 149101 153900 6.21 segment_36
LinJ22 775801 776400 5.31 segment_26
LinJ12 417901 418800 3.99 segment_18
LinJ22 780601 781200 3.98 segment_27
Table 6: Example of significantly depleted segments
chr start end score segmentID
LinJ34 825001 825900 0.47 segment_39
LinJ12 378901 379200 0.49 segment_14
LinJ12 462601 462900 0.49 segment_22

3 Gene coverage

The gene sequencing coverage was evaluated by

  • measuring the sequencing coverage of each nucleotide
  • computing the mean gene coverage excluding N positions
  • normalizing by the median chromosome coverage

The MAPQ score of the reads aligning to each gene was also evaluated. Genes with mean MAPQ score > 50 were evaluated to identify significantly amplified or depleted genes. A total of 25 genes resulted significantly amplified, and a total of 0 significantly depleted

Figure 7: Overview of chromosomes embedding significant copy number variant genes

Table 7: Example of significantly amplified genes
gene_id locus meanCoverage normalizedMeanCoverage MAPQ annotation
LINF_310009800 LinJ31:152089-152688 1155.570 6.528644 56.68084 Amastin_surface_glycoprotein_-_putative
LINF_340038500 LinJ34:1411292-1413058 263.310 3.419610 59.97573 hypothetical_protein_-_conserved
LINF_120014700 LinJ12:452302-453693 213.508 2.635901 54.55921 surface_antigen_protein_2_-_putative
LINF_080017600 LinJ08:505615-508179 229.335 2.606079 59.99312 protein_kinase_-_putative
LINF_080017300 LinJ08:497982-498281 225.482 2.562295 59.99312 hypothetical_protein_-_conserved

The table with the sequencing coverage of significantly amplified, depleted or non-varying genes can be found here.

4 Single nucleotide variants

Single nucleotide variant (SNV) analysis include

  • predicting the SNVs with freebayes[4]
  • applying custom quality filters limiting the number of predictions to a number of high quality predictions
  • predicting the SNV effect with snpEff[5]
  • computing the dN/dS ratio
Table 8: SNV statistics
metric count
total SNVs 25158
intergenic 15032
genic 10126
tot genes with SNVs 3587
mean number of SNVs per gene 2.82297184276554
genes with dN/dS ratio > 1 1681
genes with dN/dS ratio < 1 1382
genes with dN/dS ratio == 1 524
tot synonimous variations 4691
tot non-synonimous variations 5435
Table 9: SNVs examples
chr position freq AO totDepth ref_alt context EFF gene_annotation
LinJ31 146527 1 196 196 C_T ACTTGCTCGAC NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|aGc/aAc|S483N|EXON_LinJ31_145800_147974|||LINF_310009700|1) cytoskeleton-associated_protein_CAP5.5_-_putative
LinJ31 130288 1 182 182 C_A AAAAACCCCGA NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gGt/gTt|G206V|EXON_LinJ31_128094_130904|||LINF_310009300|1) calpain-like_protein_-_putative
LinJ31 146398 1 177 177 T_C AGTACTGGGTG NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|cAg/cGg|Q526R|EXON_LinJ31_145800_147974|||LINF_310009700|1) cytoskeleton-associated_protein_CAP5.5_-_putative
LinJ31 147804 1 175 175 A_T ATGGCATCAAT NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|gaT/gaA|D57E|EXON_LinJ31_145800_147974|||LINF_310009700|1) cytoskeleton-associated_protein_CAP5.5_-_putative
LinJ31 209222 1 172 172 G_A GGAGAGCACGG NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Ctc/Ttc|L157F|EXON_LinJ31_208161_209690|||LINF_310011400|1) amino_acid_transporter_aATP11_-_putative

The full set of SNVs predictions in VCF format is available from here. The filtered set of SNVs predictions in VCF format is available from here.
A tabulated version of the filtered set is available from here.

Table 10: dNdS comparison examples
gene dN dS ratio difference annotation
LINF_080008400 24 5 4.800 19 hypothetical_protein_-_conserved
LINF_160018200 18 4 4.500 14 hypothetical_protein_-_conserved
LINF_310035400 15 2 7.500 13 phosphatidylinositol-4-phosphate_5-kinase-like_protein
LINF_170010700 14 3 4.667 11 hypothetical_protein_-_conserved
LINF_300017200 11 1 11.000 10 hypothetical_protein_-_conserved

The table including the dNdS analysis of the filtered set is available from here.

Figure 8: SNV density per chromosome
Figure 9: VRF distribution
Figure 10: SNV position vs VRF
Figure 11: SNV depth vs VRF
Figure 12: variant types

5 Structural variants

delly[6] was used to detect structural variants (SVs) based of read pairing orientation and split reads information.
The genomic SVs that were evaluated include

  • inversions
  • translocations
  • duplications
  • deletions

The file in VCF format with delly's predictions is available from here.
Downstream filters were applied on the delly's output to reduce the number of false positives.
The filtered output is available from here.

Figure 13: Structural varaints analysis

6 References

[1] Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009 Jul 15;25(14):1754-60. doi:10.1093/bioinformatics/btp324. Epub 2009 May 18. PMID: 19451168; PMCID: PMC2705234.
[2] "Picard Toolkit". Broad Institute, http://broadinstitute.github.io/picard.
[3] Robinson JT, Thorvaldsdóttir H, Winckler W, Guttman M, Lander ES, Getz G,Mesirov JP. Integrative genomics viewer. Nat Biotechnol. 2011 Jan;29(1):24-6. doi: 10.1038/nbt.1754. PMID: 21221095; PMCID: PMC3346182.
[4] Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 [q-bio.GN] 2012
[5] "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3.", Cingolani P, Platts A, Wang le L, Coon M, Nguyen T, Wang L, Land SJ, Lu X, Ruden DM. Fly (Austin). 2012 Apr-Jun;6(2):80-92. PMID: 22728672
[6] Rausch T, Zichner T, Schlattl A, Stütz AM, Benes V, Korbel JO. DELLY:structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics. 2012 Sep 15;28(18):i333-i339. doi:10.1093/bioinformatics/bts378. PMID: 22962449; PMCID: PMC3436805.

GIP github page: https://github.com/giovannibussotti/GIP

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.7 LTS
## 
## Matrix products: default
## BLAS:   /usr/local/lib/R/lib/libRblas.so
## LAPACK: /usr/local/lib/R/lib/libRlapack.so
## 
## locale:
## [1] C
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.29
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.0  magrittr_1.5    tools_3.6.0     htmltools_0.5.0
##  [5] yaml_2.2.1      stringi_1.4.6   rmarkdown_2.3   highr_0.8      
##  [9] stringr_1.4.0   xfun_0.16       digest_0.6.25   mime_0.9       
## [13] rlang_0.4.7     evaluate_0.14